[GMLP-9206] Backport redis-py PR #3557 recursion fix#3
Open
waiho-gumloop wants to merge 4 commits into
Open
Conversation
Workaround for the on_connect-recursion bug in redis-py < 6.0.0b2 that crashed ~261 Celery workers across all services during the production Redis broker pod replacement on 2026-06-25 (incident GMLP-9206). Root cause: when a redis-py Connection is mid-handshake and the inner PING from check_health fails, send_packed_command calls self.connect() to recover, which re-enters on_connect -> CLIENT SETINFO -> check_health -> PING -> send_packed_command -> connect(), recursing until RecursionError crashes the worker. The upstream fix (redis-py PR celery#3557) splits connect into connect_check_health and threads check_health=False through the inner reconnect. It is in redis-py 6.0.0b2+ only and is NOT backported to the 5.x line. We cannot adopt the upstream fix without a coordinated upgrade: - kombu 5.5.x pins redis<=5.2.1 - celery 5.5.x pins kombu<5.6 - the gumloop-celery fork would need to be rebased on celery 5.6+ (which also requires forward-porting rbehal/celery's spawn pool work and our --disable-prefetch / GMLP-9012 PRs) Until that upgrade lands, this patches Connection.send_packed_command to raise ConnectionError instead of silently calling self.connect() when the socket is gone. Higher-level callers (kombu Channel, redis.client, application pools) already handle ConnectionError with proper retry, so the failure surfaces cleanly. Patch is applied via import side effect from celery/__init__.py and is a no-op when: - redis-py is not installed - redis-py >= 6.0.0b2 (detected via hasattr Connection, 'connect_check_health') References: - redis-py PR celery#3557: redis/redis-py#3557 - redis-py issue celery#3745: redis/redis-py#3745 - Linear GMLP-9206
The imported module's own docstring already explains what it does.
…vability log Replaces the minimal patch (which raised ConnectionError to break the cycle) with a faithful port of redis-py PR celery#3557's actual logic: * Adds Connection.connect_check_health(check_health: bool = True) * Adds Connection.on_connect_check_health(check_health: bool = True) * Rebinds Connection.connect() and Connection.on_connect() as backwards-compatible wrappers * In Connection.send_packed_command, replaces the reconnect branch 'if not self._sock: self.connect()' with 'self.connect_check_health(check_health=False)' so the inner reconnect does not re-fire check_health -> _send_ping -> send_packed_command -> connect -> ... (the recursion cycle) Preserves the upstream contract: a transient broker blip is recovered transparently inside send_packed_command, the outer caller's command succeeds against the freshly-established connection. Application code using redis-py directly (e.g. cache clients) no longer sees a ConnectionError on every blip, matching the behavior callers expect from redis-py. Adds one operational concession to upstream: a single WARNING per worker process the first time the suppressed-check_health reconnect fires, with subsequent reconnects logged at DEBUG. Restores a small observability signal for incident triage without log spam. Validated on the beta cluster (deterministic in-pod reproducer, 5/5 runs show recursion broken, observability WARNING fires correctly, control flow matches the expected 4-attempt retry shape).
Point to the specific line ranges of the AbstractConnection.connect and AbstractConnection.on_connect methods that this module ports, so reviewers can diff the port against the exact upstream source.
rbehal
approved these changes
Jul 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
celery/_redis_py_recursion_workaround.py, imported fromcelery/__init__.py, that backports redis-py PR #3557 forredis-py < 6.0.0b2. Faithful port of the upstream fix:Connection.connect_check_healthandConnection.on_connect_check_health.check_healththrough everysend_commandin theon_connectchain.send_packed_command's stale-socket reconnect fromself.connect()toself.connect_check_health(check_health=False).Why
On 2026-06-26T03:46:35Z–03:46:50Z, 171 Celery worker pods in production crashed with
RecursionError('maximum recursion depth exceeded while calling a Python object')during a Redis DeploymentRecreaterollout. Confirmed via prod cluster inspection: Redis ReplicaSetredis-6f9c8ffc9fwas created at2026-06-26T03:45:56Z, matching the first worker "Connection to broker lost" log at 03:45:55.820Z.The bug is in
redis-py 5.2.1's reconnect path (upstream issue #3745, fixed in PR #3557):Trigger conditions (all set in every Celery service's
broker_transport_options):health_check_interval = 30retry_on_timeout = Truelib_namenon-empty (redis-py defaults it to"redis-py")Failure-mode agnostic: the recursion fires regardless of whether the initial
_send_pingfailure isTimeoutError,ConnectionError,BusyLoadingError, or a raw socket error. This fix breaks the cycle at the reconnect point, so all these variants surface as normal handleable errors.Why not just upgrade redis-py
kombu==5.5.2pinsredis<=5.2.1.celery 5.5.xpinskombu<5.6. A redis-py bump alone is rejected byuv sync. Proper fix is a coordinatedcelery 5.6+ / kombu 5.6+ / redis-py 6.0+upgrade, which also requires rebasing this fork onto upstream celery 5.6+ (forward-porting rbehal's spawn pool work, PR #1--disable-prefetch, PR #2 GMLP-9012). That is tracked strategically. This PR is the tactical mitigation.Idempotency / no-op conditions
hasattr(Connection, 'connect_check_health')) → no-opValidation
Two reproduction methods worked on the beta cluster:
kubectl execa Python script into a worker pod that monkeypatchesredis.connection.Connection._send_pingto always raiseTimeoutErrorafter sending PING. Thenclient.ping()against live beta Redis. Unpatched → 5/5RecursionError. Patched → 5/5 non-recursion.DEBUG SLEEP+ tracer script: temporarily enabled--enable-debug-command yeson the beta Redis Deployment, thenredis-cli DEBUG SLEEP 180(backgrounded) + a tracer script inside a worker pod usingsocket_timeout=0.2so each recursion level ≈ 200 ms. Unpatched →RecursionError max_depth=996in 13.42 s. Patched → non-recursionTimeoutErroratmax_depth=29in 0.81 s. Redis deployment reverted after.5.5.3+gumloop.0.1.5RecursionError5.5.3+gumloop.0.1.5.gmlp9206.test1DEBUG SLEEP 180+socket_timeout=0.2)RecursionErroratmax_depth=996in 13.42sTimeoutErroratmax_depth=29in 0.81sTest wheel
5.5.3+gumloop.0.1.5.gmlp9206.test1is on beta cluster now. Do not promote that tag to staging/prod. Cut a clean5.5.3+gumloop.0.1.6on merge.Caveats
celeryis imported (celery/__init__.pyapplies the patch as import side-effect). For our Celery broker code paths this is universally true.References
max clients reached): Infinity retry on the exception of max number of clients reached redis/redis-py#2563